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ABSTRACT 

The power of sparse signal coding with learned dictionaries has been 
demonstrated in a variety of applications and fields, from signal pro- 
cessing to statistical inference and machine learning. However, the 
statistical properties of these models, such as underfitting or overfit- 
ting given sets of data, are still not well characterized in the literature. 
This work aims at filling this gap by means of the Minimum De- 
scription Length (MDL) principle - a well established information- 
theoretic approach to statistical inference. The resulting framework 
derives a family of efficient sparse coding and modeling (dictio- 
nary learning) algorithms, which by virtue of the MDL principle, 
are completely parameter free. Furthermore, such framework al- 
lows to incorporate additional prior information in the model, such 
as Markovian dependencies, in a natural way. We demonstrate the 
performance of the proposed framework with results for image de- 
noising and classification tasks. 

Index Terms — Sparse coding, dictionary learning, MDL, de- 
noising, classification 

1. INTRODUCTION 

Sparse models are by now well established in a variety of fields and 
applications, including signal processing, machine learning, and sta- 
tistical inference, e.g. [ , , ] and references therein. 

When sparsity is a modeling device and not a hypothesis about 
the nature of the analyzed signals, parameters such as the desired 
sparsity in the solutions, or the size of the dictionaries to be learned, 
play a critical role in the effectiveness of sparse models for the tasks 
at hand. However, lacking theoretical guidelines for such param- 
eters, published applications based on learned sparse models often 
rely on either cross-validation or ad-hoc methods for learning such 
parameters (an exception for example being the Bayesian approach, 
e.g., [ ]) . Clearly, such techniques can be impractical and/or inef- 
fective in many cases. 

At the bottom of this problem lie fundamental questions such 
as: how rich or complex is a sparse model? how does this depend on 
the required sparsity of the solutions, or the size of the dictionaries? 
what is the best model for a given data class? A possible objective 
answer to such questions is provided by the Minimum Description 
Length principle (MDL) [5], a general methodology for assessing 
the ability of statistical models to capture regularity from data. The 
MDL principle is often regarded as a practical implementation of the 
Occam's razor principle, which states that, given two descriptions 
for a given phenomenon, the shorter one is usually the best. In a 
nutshell, MDL equates "ability to capture regularity" with "ability to 
compress" the data, and the metric with which models are measured 
in MDL is codelength or compressibility. 
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The idea of using MDL for sparse signal coding was explored in 
the context of wavelet-based image denoising [ , ]. These pioneer- 
ing works were restricted to denoising using fixed orthonormal basis 
(wavelets). In addition, the underlying probabilistic models used 
to describe the transform coefficients, which are the main technical 
choice to make when applying MDL to a problem, were not well 
suited to the actual statistical properties of the modeled data (image 
encoding coefficients), thus resulting in poor performance. Further- 
more, these works did not consider the critical effects of quantization 
in the coding, which needs to be taken into account when working 
in a true MDL framework (a useful model needs to be able to com- 
press, that is, it needs to produce actual codes which are shorter than 
a trivial description of the data). Finally, these models are designed 
for noisy data, with no provisions for the important case of noiseless 
data modeling, which addresses the errors due to deviations from the 
model, and is critical in many applications such as classification. 

The framework presented in this work addresses all of the above 
issues in a principled way: i) Efficient codes are used to describe 
the encoded data; ii) Deviations from the model are taken into ac- 
count when modeling approximation errors, thus being more general 
and robust than traditional sparse models; iii) Probability models for 
(sparse) transform coefficients are corrected to take into account the 
high occurrence of zeros; iv) Quantization is included in the model, 
and its effect is treated rigorously; v) Dictionary learning is formu- 
lated in a way which is consistent with the model's statistical as- 
sumptions. At the theoretical level, this brings us a step closer to the 
fundamental understanding of sparse models and brings a different 
perspective, that of MDL, into the sparse world. From a practical 
point of view, the resulting framework leads to coding and modeling 
algorithms which are completely parameter free and computation- 
ally efficient, and practically effective in a variety of applications. 

Another important attribute of the proposed family of models is 
that prior information can be easily and naturally introduced via the 
underlying probability models defining the codelengths. The effect 
of such priors can then be quickly assessed in terms of the new code- 
lengths obtained. For example, Markovian dependencies between 
sparse codes of adjacent image patches can be easily incorporated. 

2. BACKGROUND ON SPARSE MODELS 

Assume we are given n ra-dimensional data samples ordered as 
columns of a matrix Y = [yi |y2 1 • • • \y n ] G R mXn . Consider a 
linear model for Y, Y = DA + E, where D = [di |d 2 1 . . . |d p ] is 
anmxp dictionary consisting of p atoms, A = [ai|a2| . . . |a n ] G 
R pXn is a matrix of coefficients where each j-th column a^ specifies 
the linear combination of columns of D that approximates y j , and 
E = [ei|e 2 | . . . |e n ] G R mXn is a matrix of approximation errors. 
We say that the model is sparse if, for all or most j — 1 , . . . , n, we 
can achieve \\ej || 2 « while requiring that only a few coefficients 
7 <C p in a.j can be nonzero. 

The problem of sparsely representing Y in terms of D, which 



we refer to as the sparse coding problem, can be written as 

kj =argmin ||a|| s.t. Hy, -Da|| 2 < e, j = 1, . . . , n, (1) 

a 

where ||a|| = 7 is the pseudo-norm that counts the number of 
nonzero elements in a vector a, and e is some small constant. There 
is a body of results showing that the problem (1), which is non- 
convex and NP-hard, can be solved exactly when certain conditions 
on D and A are met, by either well known greedy methods such as 
Matching Pursuit [ ], or by solving a convex approximation to (1), 
commonly referred to as Basis Pursuit (BP) [9], 

a.j = argmin ||a|L s.t. ||yj — Da|L < e, j = 1, . . . , n. (2) 

a ' ~" 

When D is included as an optimization variable, we refer to the re- 
sulting problem as sparse modeling. This problem is often written 
in unconstrained form, 

n ^ 

(D, A) = argmin^ - ||y, - Da^ 2 + A Ha^, (3) 

' j=i 

where A > is an arbitrary constant. The problem in this case is 
non-convex in (D, A), and one must be contempt with finding local 
minima. Despite this drawback, in recent years, models learned by 
(approximately) minimizing (3) have shown to be very effective for 
signal analysis, leading to state-of-the-art results in several applica- 
tions such as image restoration and classification. 

2.1. Model complexity of sparse models 

In sparse modeling problems where D is learned, parameters such as 
the desired sparsity 7, the penalty A in (3), or the number of atoms 
p in D, must be chosen individually for each application and type 
of data to produce good results. In such cases, most sparse model- 
ing techniques end-up using cross-validation or ad-hoc techniques to 
select these critical parameters. An alternative formal path is to pos- 
tulate a Bayesian model where these parameters are assigned prior 
distributions, and such priors are adjusted through learning. This ap- 
proach, followed for example in [ ], adds robustness to the modeling 
framework, but leaves important issues unsolved, such as providing 
objective means to compare different models (with different priors, 
for example). The use of Bayesian sparse models implies having to 
repeatedly solve possibly costly optimization problems, increasing 
the computational burden of the applications. 

In this work we propose to use the MDL principle to formally 
tackle the problem of sparse model selection. The goal is twofold: 
for sparse coding with fixed dictionary, we want MDL to tell us the 
set of coefficients that gives us the shortest description of a given 
sample. For dictionary learning, we want to obtain the dictionary 
which gives us the shortest average description of all data samples 
(or a representative set of samples from some class). A detailed de- 
scription of such models, and the coding and modeling algorithms 
derived from them, is the subject of the next section. 

3. MDL-BASED SPARSE CODING AND MODELING 
FRAMEWORK 

Sparse models break the input data Y into three parts: a dictionary 
D, a set of coefficients A, and a matrix of reconstruction errors. In 
order to apply MDL to a sparse model, one must provide codelength 
assignments for these components, -L(A), L(D) and L(E), so that 
the total codelength L(Y) = L(E) + L(A) + L(D) can be com- 
puted. In designing such models, it is fundamental to incorporate as 



much prior information as possible so that no cost is paid in learning 
already known statistical features of the data, such as invariance to 
certain transformations or symmetries. Another feature to consider 
in sparse models is the predominance of zeroes in A. In MDL, all 
this prior information is embodied in the probability models used 
for encoding each component. What follows is a description of such 
models. 

Sequential coding- In the proposed framework, Y is encoded se- 
quentially, one column yj at a time, for j = 1, 2, . . . , n, possibly us- 
ing information (including dependencies) from previously encoded 
columns. However, when encoding each column yj, its sparse coef- 
ficients &j are modeled as and IID sequence (of fixed length p). 
Quantization- To achieve true compression, the finite precision of 
the input data Y must be taken into account. In the case of digital 
images for example, elements from Y usually take only 256 possi- 
ble values (from to 255 in steps of size 5 y = 1). Since there is 
no need to encode E with more precision than Y, we set the error 
quantization step to 5 e — S y . As for D and A, the corresponding 
quantization steps 5 a and 5 a need to be fine enough to produce fluc- 
tuations on DA which are smaller than the precision of Y, but not 
more. Therefore, the actual distributions used are discretized ver- 
sions of the ones discussed below. 

Error- The elements of E are encoded with an IID model where 
the random value of a coefficient (represented by an r.v. e) is the 
linear superposition of two effects: e = e + 77, where r\ ~ A/"(0, cr 2 ), 
cr 2 assumed known, models noise in Y due to measurement and/or 
quantization, and e ~ £(0, e ) is a zero mean, heavy-tailed (Lapla- 
cian) error due to the model. The resulting distribution for e, which 
is the convolution of the involved Laplacian and Gaussian distribu- 
tions, was developed in [10] under the name "LG." This model will 
be referred to as P e (• ; cr 2 , e ) hereafter. 

Sparse code- Each coefficient in A is modeled as the product of 
three (non-independent) random variables (see also [ ] for a related 
model), a = C ( / ) ( zy + where £ ~ Ber(p) is a support indi- 
cator, that is, C — 1 implies a ^ 0, <j> = sgn(a), and v — 
max{|a| — 5 a , 0} is the absolute value of a corrected for the fact 
that v > 5 a when £ = 1. Conditioned on C = 0, <f) — 1/ — with 
probability 1. Conditioned on £ = 1, we assume (j) ~ Ber(l/2), 
and v to be Exp(0 u ). Note that, with these choices, P(cj)is\( = 1) 
is a Laplacian, which is a standard model for transform (e.g., DCT, 
Wavelet) coefficients. The probability models for the variables £ and 
v will be denoted as P^(-; p) and P U (-;0 U ) respectively. 
Dictionary- We assume the elements of D to be uniformly dis- 
tributed on [—1,1]. Following the standard MDL recipe for en- 
coding model parameter values learned from n samples, we use 
a quantization step Sd — n -1 / 2 [ ]. For these choices we have 
L(D) = mp\og 2 n, which does not depend on the element values 
of D but only on the number of atoms p and the size of Y. Other 
possible models which impose structure in D, such as smoothness in 
the atoms, are natural to the proposed framework and will be treated 
in the extended version of this work. 

3.1. Universal models for unknown parameters 

The above probability models for the error e, support £ and (shifted) 
coefficient magnitude v depend on parameters which are not known 
in advance. In contrast with Bayesian approaches, cross-validation, 
or other techniques often used in sparse modeling, modern MDL 
solves this problem efficiently by means of the so called universal 
probability models [5]. In a nutshell, universal models provide op- 
timal codelengths using probability distributions of a known family, 
with unknown parameters, thus generalizing the classic results from 



Shannon theory [11]. 

Following this, we substitute ( • ; p) , P e ( • ; a% , 6 e ) and P„(-;6 V ) 
with corresponding universal models. For describing a given support 
z, we use an enumerative code [ ], which first describes the size of 
the support 7 with log 2 p bits, and then the particular arrangement 
of non-zeros in z using log 2 ( p ) bits. P e (-;a^O e ) and P U (-;0 V ) are 
substituted by corresponding universal mixture models, one of the 
possibilities dictated by the theory of universal modeling, 

r +00 



Algorithm 1 : Codelength-based Forward Selection. 



r+00 

Qe{u]o\ 1 ) = l w e (0)P e (u]a^, 
Jo 

r+00 

Qu(u)= w u (0)P u (u; 
Jo 



where the mixing functions w e (Q) and w a (0) are Gamma dis- 
tributions (the conjugate prior for the exponential distribution), 
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with fixed parameters 



(K,e,/3e) and (k U j/3u) respectively. The resulting Mixture of Expo- 
nentials (MOE) distribution Q I/ (-), is given by (see [13] for details), 



Q u (u\/3 u ,k u ) = k u /3Z u (u + p u ) 



-(K„+l) 



u G M + . Observing 



that the convolution and the convex mixture operations that result 
in Q e (-; 0%) are interchangeable (both integrals are finite), it turns 
out that Q e ( ; cr^) is a convolution of a MOE (of hyper-parameters 
(« e ,/3 e )) and a Gaussian with parameter cr^. Thus, although the 
explicit formula for this distribution, which we call MOEG, is cum- 
bersome, we can easily combine the results in [ ] for the LG, 
and for MOE in [ ] to perform tasks such as parameter estimation 
within this model. Note that the universality of these mixture models 
does not depend on the values of the hyper-parameters, and their 
choice has little impact on their overall performance. Here, guided 
by [13], we set n v — n e — 3.0, fi u = /3 e = S a . 

Following standard practice in MDL, the ideal Shannon code is 
used to translate probabilities into codelengths. Under this scheme, a 
sample value u with probability P(u) is assigned a code with length 
L(u) = — log P(u) (this is an ideal code because it only specifies a 
codelength, not a specific binary code, and because the codelengths 
produced can have a fractional number of bits). Then, for a given 
error residual e and coefficients magnitude vector v = [max{ \ca\ — 
S a , 0}]i=i,..., m , the respective ideal codelengths will be L e (e) = 

EHi- lo g2Q e ( e and M v ) = Efc=iz fc =i- lo g2Q^(>fc). Fi- 
nally, following the assumed Ber(l/2) model, the sign of each non- 
zero element in a is encoded using 1 bit, for a total of 7 bits. 

3.2. MDL-based sparse coding algorithms 

The goal of a coding algorithm in this framework is to obtain, for 
each j-th sample yj, a vector which minimizes its description, 

= argminL(e, a) = L e (e J -)+L^(z)+L(s|z) + L I/ (v|z). (4) 

a 

As it happens with most model selection algorithms, considering the 
support size (7 = ||aj|| ) explicitly in the cost function results in 
a non-convex, discontinuous objective function. A common proce- 
dure in sparse coding for this case is to estimate the optimum support 
zj for each possible support size, 7 = 1, 2, . . . ,p. Then, for each 
optimum support {zj : 7 = 1,..., p}, (4) is solved in terms of the 
corresponding non-zero values of a^ , yielding a candidate solution 
aj, and the one producing the smallest codelength is assigned to kj. 
As an example of this procedure, we propose Algorithm 1 , which is 
a variant of Matching Pursuit [ ]. As in [ ], we start with a° = 0, 
adding one new atom to the active set in each iteration. However, 
instead of adding the atom that is maximally correlated with the cur- 
rent residual e to the active set, we add the one yielding the largest 



Input: Data sample y, dictionary D 

Output: The sparse code for y, a 

initialize a <(— 0; e <(— y; L <(— 1/(0); z <s— ; 

initialize g <(— D T e ; //correlation of current error with the dictionary 
repeat 

for k = 1 , . . . , p : Zk = do 

Afc [9k] s a ' //step Ak is correlation, quantized to prec. 5 a 
Lk ^— L(a + AkOJk) ; //ojk is the k-th canonical vector ofR p 

end 

Choose /c = arg mirifc :2fc= o{I/fc }; 
if L k < Lthen 

L <(— ; // update current smallest codelength 

Zfe <(— 1 ; // update support vector 

a <(— a + ^k^k ' //update coefficients vector 

g^— g — A^,d^,; // update correlation 

end 

until Lt > L ; 



decrease in overall codelength. The algorithm stops when no further 
decrease in codelength is obtained by adding a new atom. 

An alternative for estimating {zj : 7 = 1, ... ,p} is to use a 
convex model selection algorithm such as LARS/Lasso [ ], which 
also begins with a° = 0, adding one atom at a time to the solution. 
For this case we propose to substitute the £2 loss in LARS/Lasso 
by — logLG(-), which is more consistent with (4) and can be ap- 
proximated by the Huber loss function [15]. This alternative will be 
discussed in detail in the extended version of this work. 

3.3. Dictionary learning 

Dictionary learning in the proposed framework proceeds in two 
stages. In the first one, a maximum dictionary size p ma x is fixed 
and the algorithm learns D using alternate minimization in A and 
D, as in standard dictionary learning approaches, now with the new 
codelength-based metric. First, D is fixed and A is updated as 
described in Section 3.2. Second, keeping A fixed, the update of 
D reduces to argminD L e (Y — DA) (recall that L(D) depends 
only on p, thus being constant at this stage). We add the constraint 



that lldjfc 



< 1,/c = 1, 



, p, a standard form of dictionary regu- 



larization. Here too we approximate L e (-) with the Huber function, 
obtaining a convex, differentiable dictionary update step which can 
be efficiently solved using scaled projected gradient. In practice, this 
function produces smaller codelengths than using standard £2 -based 
dictionary update, at the same computational cost (see Section 4). 

In the second stage, the size of the dictionary is optimized by 
pruning atoms whose presence in D actually results in an increased 
average codelength. In practice, the final size of the dictionary re- 
flects the intuitive complexity of the data to be modeled, thus vali- 
dating the approach (see examples in Section 4). 

4. RESULTS AND CONCLUSION 

The first experiment assesses that the learned models produce com- 
pressed descriptions of natural images. For this, we adapted a dic- 
tionary to the Pascal'06 image database 1 , and encoded several of its 
images. The average bits per pixel obtained was 4.08 bits per pixel 
(bpp), with p = 250 atoms in the final dictionary. We repeated this 
using £2 instead of Huber loss, obtaining 4.12 bpp andp = 245. 

We now show example results obtained with our framework in 
two very different applications. In both cases we exploit spatial cor- 



1 http://pascallin.ecs.soton.ac. uk/ challenges/ 
VOC/databases . html 
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Fig. 1. Denoising results. Left to right: PSNR of denoised images 
for different methods and noise levels (including [ ] as a reference), 
sample recovered patch from Barbara, best dictionary for Barbara 
(p = 200, we used an initial p ma x = 512). 



relation between codes of adjacent patches by learning Markovian 
dependencies between their supports (see [16] for related work in 
the Bayesian framework). More specifically, we condition the prob- 
ability of occurrence of an atom at a given position in the image, on 
the occurrence of that same atom in the left, top, and top-left patches. 
Note that, in both applications, the results were obtained without the 
need to adjust any parameter (although 5 a is a parameter of Algo- 
rithm 1, it was fixed beforehand to a value that did not introduce 
noticeable additional distortion in the model, and left untouched for 
the experiments shown here). 

The first task is to estimate a clean image from an observed noisy 
version corrupted by Gaussian noise of known variance a^. Here Y 
contains all (overlapping) 8x8 patches from the noisy image. First, 
a dictionary D is learned from the noisy patches. Then each patch is 
encoded using D with a denoising variant of Algorithm 1 , where the 
stopping criterion is changed for the requirement that the observed 
distortion falls within a ball of radius ^/mo^ . Finally, the estimated 
patches are overlapped again and averaged to form the denoised im- 
age. From the results shown in Figure 1, it can be observed that 
adding a Markovian dependency between patches consistently im- 
proves the results. Note also that in contrast with [ ], these results 
are fully parameter free, while those in [ ] significantly depend on 
carefully tuned parameters. 

The second application is texture segmentation via patch clas- 
sification. Here we are given c images with sample textures, and a 
target mosaic of textures, and the task is to assign each pixel in the 
mosaic to one of the textures. Again, all images are decomposed 
into overlapping patches. This time a dictionary D r is learned for 
each texture r = 1 , . . . , c using the patches from the training im- 
ages. Then, each patch in the mosaic is encoded using all available 
dictionaries, and its center pixel is assigned to the class which pro- 
duced the shortest description length for that patch. The final result 
includes a simple 3x3 median filter to smooth the segmentation. 
We show a sample result in Figure 2. Here, again, the whole process 
is parameter free, and adding Markovian dependency improves the 
overall error rate (7.5% against 8.5%). 

In summary, we have presented an MDL-based sparse modeling 
framework, which automatically adapts to the inherent complexity of 
the data at hand using codelength as a metric. As a result, the frame- 
work can be applied out-of-the-box to very different applications, 
obtaining competitive results in all the cases presented. We have 
also shown how prior information, such as spatial dependencies, are 
easily added to the framework by means of probability models. 
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